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Nonnegative matrix factorization (NMF) has been shown recently to be tractable under the 
separability assumption, under which all the columns of the input data matrix belong to the convex 
cone generated by only a few of these columns. Bittorf, Recht, Re and Tropp ('Factoring non- 
negative matrices with linear programs', NIPS 2012) proposed a linear programming (LP) model, 
referred to as Hottopixx, which is robust under any small perturbation of the input matrix. How- 
ever, Hottopixx has two important drawbacks: (i) the input matrix has to be normalized, and 
(ii) the factorization rank has to be known in advance. In this paper, we generalize Hottopixx in 
order to resolve these two drawbacks, that is, we propose a new LP model which does not require 
normalization and detects the factorization rank automatically. Moreover, the new LP model is 
more flexible, significantly more tolerant to noise, and can easily be adapted to handle outliers and 
other noise models. Finally, we show on several synthetic datasets that it outperforms Hottopixx 
^ ■ while competing favorably with two state-of-the-art methods. 

00 

^-j- \ tion, robustness to noise, pure-pixel assumption, hyperspectral unmixing. 

^ ! 1 Introduction 

Nonnegative matrix factorization (NMF) is a powerful dimensionality reduction technique as it auto- 
matically extracts sparse and meaningful features from a set of nonnegative data vectors: Given n non- 
negative m-dimensional vectors gathered in a nonnegative matrix M G K+ Xn and a factorization rank r, 
NMF computes two nonnegative matrices W G xr and H G W+ n such that M ^ WH. In this way, 
the columns of the matrix W form a basis for the columns of M since M(:,j) ~ X^fc=i ^( : > k)H(k,j) 
for all j. Moreover, the nonnegativity constraint on the matrices W and H leads these basis elements 
to represent common localized features appearing in the dataset as no cancellation can happen in 
the reconstruction of the original data. Unfortunately, NMF is NP-hard in general [14], and highly 
ill-posed; see |10| and the references therein. However, if the input data matrix M is r-separable, that 
is, if it can be written as 

M = W[I r , H']U, 

where I r is the r-by-r identity matrix, H' > and II is a permutation matrix, then the problem can 
be solved in polynomial time |2J. Algebraically, separability means that there exists a rank-r NMF 
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(W, H) > of M where each column of W is equal to some column of M. Geometrically, r-separability 
means that the cone generated by the columns of M has r extreme rays given by the columns of W. 
Equivalently, if the columns of M are normalized to sum to one, r-separability means that the convex 
hull generated by the columns of M has r vertices given by the columns of W; see, e.g., |13j . The 
separability assumption is far from being artificial in several applications: 

• In text mining, where each column of M corresponds to a word, separability means that, for each 
topic, there exists a word associated only with that topic; see [21 [3]. 

• In hyperspectral imaging, where each column of M equals the spectral signature of a pixel, 
separability means that, for each constitutive material ("endmembers") present in the image, 
there exists a pixel containing only that material; see and the references therein. This 
assumption is referred to as the pure-pixel assumption. 

• In blind source separation, where each column of M is a signal measure at a given point in time, 
separability means that, for each source, there exists a point in time where only that source is 
active; see O [6] and the references therein. 

Under the separability assumption, NMF reduces to identifying, among the columns of M, the 
columns of W allowing to reconstruct all columns of M. In fact, given W, the matrix H can be obtained 
by solving a linear system (or, in the noisy case, a convex optimization problem mina>o||M — Wi7||). 

In this paper, we consider the noisy variant of this problem, referred to as near- separable NMF: 

(Near-Separable NMF) Given a noisy r-separable matrix M = M + N with M = WH = 
W[I r ,H']U where W and H' are nonnegative matrices, II is a permutation matrix and N 
is the noise, find a set /C of r indices such that M(:,IC) ~ W. 

Several algorithms have been proposed to solve this problem [21 El El El HH [13]. In this paper, our 
focus is on the linear programming (LP) model proposed by Bittorf, Recht, Re and Tropp [4] and 
referred to as Hottopixx. It is described in the next section. 

Remark 1 (Nonnegativity of M). In the formulation of near- separable NMF, the input data matrix 
M is not necessarily nonnegative since there is not restriction on the noise N . In fact, we will only 
need to assume that the noise is bounded, but otherwise it is arbitrary; see Section^ 



1.1 Hottopixx, a Linear Programming Model for Near-Separable NMF 

A matrix M is r-separable if and only if 

m = wh = w[i r , H']n = [w, wh']u = [w, wh'] ( Ir H ) n = mi , (i) 

v v ' 

for some permutation II and some matrix H' > 0. The matrix X° is an n-by-n nonnegative matrix 
with (n — r) zero rows such that M = MX . Assuming the columns of M sum to one, the columns 
of W and H' have sum to one as well. Based on these observations, Bittorf, Recht, Re and Tropp [1] 
proposed to solve the following optimization problem in order to approximately identify the columns 
of the matrix W among the columns of the matrix M = M + N where ||iV||i = maxj||iV(:, < e : 

min p T diag(X) 

Xm nxu 

such that \\M - MX\\\ < 2e, 

tr(X) = r, (2) 
X(i, i) < 1 for all i, 
X(i,j) < X(i,i) for all 
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where p is any n-dimensional vector with distinct entries; see Algorithm [TJ 



Algorithm 1 Hottopixx - Extracting Columns of a Noisy Separable Matrix by Linear Programming 

Input: A normalized noisy r-separable matrix M = WH + A E M™ xn , the factorization rank r, the 

noise level ||A||i < e and a vector p G M n with distinct entries. 
Output: A matrix W such that W ~ W (up to permutation). 

1: Find the optimal solution X* of (j2J) where p has distinct entries. 

2: Let K be the index set corresponding to the r largest diagonal entries of X* . 

3: Set W = M(:,/C). 



Intuitively, the LP models ([2]) assigns a total weight r to the n diagonal entries of the variable X in 
such a way that M can be well approximated using nonnegative linear combinations of columns of M 
corresponding to positive diagonal entries of X. Moreover, the weights used in the linear combinations 
cannot exceed the diagonal entries of X since X(:,j) < diag(X) for all j. There are several drawbacks 
in using the LP model ([2]) in practice: 

1. The factorization rank r has to be chosen in advance. In practice the true factorization rank is 
often unknown, and a "good" factorization rank for the application at hand is typically found by 
trial and error. Therefore the LP above may have to be resolved many times. 

2. The columns of the input data matrix have to be normalized in order to sum to one. This may 
introduce significant distortions in the dataset and lead to poor performance; see |13| where some 
numerical experiments are presented. 

3. The noise level ||iV||i < e has to be estimated. 

4. One has to solve a rather large optimization problem with n 2 variables, so that the model cannot 
be used directly for huge-scale problems. 

It is important to notice that there is no way to getting rid of both drawbacks 2. and 3. In fact, in 
the noisy case, the user has to indicate either 

• The factorization rank r, and the algorithm should find a subset of r columns of M as close as 
possible to the columns of W, or 

• The noise level e, and the algorithm should try to find the smallest possible subset of columns of 
M allowing to approximate M up to the required accuracy. 

1.2 Contribution and Outline of the Paper 

In this paper, we generalize Hottopixx in order to resolve drawbacks 1. and 2. above. More precisely, 
we propose a new LP model which has the following properties: 

• Given the noise level e, it detects the number r of columns of W automatically; see Section [21 

• It can be adapted to dealing with outliers; see Section [3j 

• It does not require column normalization; see Section [5] 

• It is significantly more tolerant to noise than Hottopixx. In fact, we propose a tight robustness 
analysis of the new LP model proving its superiority (see Theorems [1] and [2J . This is illustrated 
in Section [5] on several synthetic datasets, where the new LP model is shown to outperform 
Hottopixx while competing favorably with two state-of-the-art methods, namely the successive 
projection algorithm (SPA) from [1] [11] and the fast conical hull algorithm (XRAY) from [13J. 

1 Strictly speaking, ([2]) is not a linear program but it can be reformulated as one. 
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2 Detecting the Factorization Rank Automatically 

In this section, we analyze the following LP model: 

min p T diag(X) 

such that \\M - MX\\x < pe, (3) 
X(i, i) < 1 for all i, 
X(i,j) < X(i,i) for all 

where p has positive entries and p > is a parameter. We also analyze the corresponding near- 
separable NMF algorithm (Algorithm [5]) with an emphasis on robustness. The LP model ([3]) is 

Algorithm 2 Extracting Columns of a Noisy Separable Matrix by Linear Programming 

Input: A normalized noisy r-separable matrix M = WH + N £ W^ xn , the noise level \\N\\i < e, a 

parameter p > and a vector p € W 1 with positive distinct entries. 
Output: An m-by-r matrix W such that W ~W (up to permutation). 

1: Compute an optimal solution X* of Q. 
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Let fC be the index set corresponding to the diagonal entries of X* larger than 1 — mm ( 1 'P) 
W = M(:,1C). 



exactly the same as (|2|) except that the constraint tr(A) = r has been removed, and that there is an 
additional parameter p. Moreover, the vector p € lR n in the objective function has to be positive, or 
otherwise any diagonal entry of an optimal solution of (J3j) corresponding to a negative entry of p will 
be equal to one (in fact, this reduces the objective function the most while minimizing \\M — MA||i). 
A natural value for the parameter p is two, as in the original LP model ([2]), so that the matrix X° 
in Equation ([TJ identifying the set of columns of M corresponding to the columns of W is feasible. 
However, the model ([3]) is feasible for any p > since the identity matrix of dimension n (that is, 
X = I n ) is always feasible. Hence, it is not clear a priori which value of p should be chosen. The 
reason we analyze the LP model j3]) for different values of p is two-fold: 

• First, it shows that the LP model is rather flexible as it is not too sensitive to the right-hand 
side of the constraint \\M — MX\\i < pe. In other terms, the noise level does not need to be 
known precisely for the model to make sense. This is a rather desirable property as, in practice, 
the value of e is typically only known/evaluated approximately. 

• Second, we observed that taking p smaller than two gives in average significantly better results 
(see Section[5]for the numerical experiments). Our robustness analysis of Algorithm [2] will suggest 
that the best choice is to take p = 1. 

In this section, we prove that the LP model ([3]) allows to identifying approximately the columns 
of the matrix W among the columns of the matrix M for any p > 0, given that the noise level e is 
sufficiently small (e will depend on the value p); see Theorems [TJ [2] and [3l 

Before stating the robustness results, let us define the conditioning of a nonnegative matrix W 
whose columns sum to one: 

k = min min k) — W(:, fC)x\\i, where K = {1, 2, . . . , r}\{fc}, 

and the matrix W is said to be K-robustly conical. The parameter < k < 1 tells us how well the 
columns of W are spread in the unit simplex. In particular, if k = 1, then W contains the identity 
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matrix as a submatrix (all other entries being zeros) while, if k = 0, then at least one of the columns 
of W belongs to the convex cone generated by the others. Clearly, the better the columns of W are 
spread across the unit simplex, the less sensitive is the data to noise. For example, e < l| is a necessary 
condition to being able to distinguish the columns of W [9]. 



2.1 Robustness Analysis without Duplicates and Near Duplicates 

In this section, we assume that the columns of W are isolated (that is, there is no duplicate nor 
near duplicate of the columns of W in the dataset) hence more easily identifiable. This type of 
margin constraint is typical in machine learning and is equivalent to bounding the entries of H' 
in the expression M = W[I r , H']H, see Equation JT]). In fact, for any 1 < k < r and h € KIj_ with 
maxj h(i) < < 1, we have that 

WW^Q-WhW! = ||(1 — h(k))W(:, k) — W(:,JC)h(JC)\\i 
>(l-/3) min \\W(:, k) - W(:,K.)y\\i 

> (1-/3K 

where /C = {1, 2, . . . , r}\{/c}. Hence maxy H[- < f3 implies that all data points are at distance at least 
(1 — (3)n of any column of W. Under this condition, we have the following robustness result: 

Theorem 1. Suppose M = M + N where the columns of M sum to one, M = WH admits a rank-r 
separable factorization of the form ([1]) with H > 0, maxjj H-j < (3 < 1 and W K-robustly conical with 
k>0, and \\N\\i < e. If 

< k(1 - g) min(l,p) 
5(p + 2) 

then Algorithm^ extracts a matrix W G W nxr satisfying \\W — W(:,P)\\i < e for some permutation P. 

Proof See Appendix |Al □ 

Remark 2 (Noiseless case). When there is no noise (that is, N = and e = 0), duplicates and near 
duplicates are allowed in the dataset; otherwise e > implying that /3 < 1 hence the columns of W are 
isolated. 

Remark 3 (A slightly better bound). The bound on the allowable noise in Theorem^ can be slightly 
improved, so that under the same conditions we can allow a noise level of 

k(1 — 13) min(l, p) 



4Qo + 2) + k(1 -P)mhx{l,p) 
However, the scope for substantial improvements is limited, as we will show in Theorem 
Remark 4 (Best choice for p). Our analysis suggests that the best value for p is one. In fact, 

min(l,p) 

argmax ^° WTW = 

In this particular case, the upper bound on the noise level to guarantee recovery is given by e < jjg 

while, for p = 2, we have e < K ^\q^ ■ The choice p = 1 is also optimal in the same sense for the bound 
in the previous remark. We will see in Section 0, where we present some numerical experiments, that 
choosing p = 1 works remarkably better than p = 2. 
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It was proven in [9] that, for Algorithm [T] to extract the columns of W under the same assumptions 
as in Theorem [TJ it is necessary that 

e < 7 7\7i o\ ; , for an y r > 3 and /3 < 1, 

(r-l)(l-/3) + l 

while it is sufficient that e < 9(^+1) ■ Therefore, if there are no duplicate nor near duplicate of the 
columns of W in the dataset, 

Algorithm^ is more robust than Hottopixx (Algorithm [ip.- in fact, unlike Hottopixx, its 
bound on the noise to guarantee recovery (up to the noise level) is independent of the 
number of columns of W. Moreover, given the noise level, it detects the number of columns 
of W automatically. 

The reason for the better performance of Algorithm [2] is the following: for most noisy r-separable 
matrices M, there typically exist matrices X' satisfying the constraints of ([3|) and such that tr(X') < r; 
see, e.g., the constructions in [9j. Therefore, the remaining weight (r — tr(X')) will be assigned by 
Hottopixx to the diagonal entries of X' corresponding to the smallest entries of p, since the objective is 
to minimize p T diag(X'). These entries are unlikely to correspond to columns of W (in particular, ifp in 
chosen by an adversary as in |9j). We observed that when the noise level e increases, r-tr(X') increases 
as well, hence it becomes likely that some columns of W will not be identified. This explains why the 
LP model enforcing the constraint tr(X) = r is less robust, and why its bound on the noise depends 
on the factorization rank r. Moreover, the LP ([2]) is also much more sensitive to the parameter e than 
the model LP ©: 

• For e sufficiently small, it becomes infeasible, while, 

• for e too large, the problem described above is worsened: there are matrices X' satisfying the con- 
straints of ([3]) and such that tr(X') <C r, hence Hottopixx will perform rather poorly (especially 
in the worst-case scenario, that is, if the problem is set up by an adversary). 

To conclude this section, we prove that the bound on the noise level e to guarantee the recovery 
of the columns of W by Algorithm [2] given in Theorem [1] is tight up to some constant multiplicative 
factor. 

Theorem 2. For any fixed p > and (3 < 1, the bound on e in Theorem [7] is tight up to a mul- 
tiplicative factor. In fact, under the same assumptions on the input matrix M , it is necessary that 
€ < K(i-/3)min(i,p) Algorithm^ to extract a matrix W G flj mxr satisfying \\W — W"(:,P)||i < e for 
some permutation P. 

Proof. See Appendix [Bj □ 
For example, Theorem [2] implies that, for p = 1, the bound of Theorem[T]is tight up to a factor -y. 

2.2 Robustness Analysis with Duplicates and Near Duplicates 

In case there are duplicates and near duplicates in the dataset, it is necessary to apply a post-processing 
to the solution of ([3]). In fact, although we can guarantee that there is a subset of the columns of M 
close to each column of W whose sum of the corresponding diagonal entries of an optimal solution 
of j3|) is large, there is not guarantee that the weight will be concentrated only in one entry. It is then 
required to apply some post-processing based on the distances between the data points to the solution 
of (|3|) (instead of simply picking the r indices corresponding to its largest diagonal entries) in order to 
obtain a robust algorithm. In particular, using Algorithm [5] to post-process the solution of ([2]) leads 
to a more robust algorithm than Hottopixx [9J. Note that pre-processing would also be possible [8j[2]. 

Therefore, we propose to post-process an optimal solution of (J3J) with Algorithm UJ see Algorithm [3J 
for which we can prove the following robustness result: 
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Theorem 3. Let M = WH be an r-separable matrix whose columns sum to one of the form ([T]) with 
H > and W K-robustly conical. Let also M = M + N with \\N\\i < e. // 

LdK 

€ < 99(r + l)' 

where u = vaxni^j\\W(:,i) — W(:,j)\\i, then Algorithm^ extracts a matrix W such that 
\\W — W(:,P)\\i < 5 = 49(r + 1) — h 2e, for some permutation P. 

K 

Proof. See Appendix O (f° r simplicity, we only consider the case p = 2; the proof can be generalized 
for other values of p > in a similar way as in Theorem [I]) . □ 

This robustness result is the same as for the algorithm using the optimal solution of ([2]) post- 
processed with Algorithm 3] [9] . Hence, in case there are duplicates and near duplicates in the dataset, 
we do not know if Algorithm [3] is more robust, although we believe the bound for Algorithm [3] can be 
improved (in particular, that the dependence in r can be removed), this is a topic for further research. 

Algorithm 3 Extracting Columns of a Noisy Separable Matrix by Linear Programming 

Input: A normalized r-separable matrix M = WH + N, the noise level ||iV||i < e and a vector p £ WL 

with positive distinct entries. 
Output: An m-by-r matrix W such that W ~W (up to permutation). 

1: Compute the optimal solution X* of ([3]) where p = e is the vector of all ones and p = 2. 

2: K — post-processing ^M, diag(A*), ; 

3: W = M(:,K.)) 



Remark 5 (Choice of p). Although Theorem^ requires the entries of the vector p to be all ones, we 
recommend to take the entries of p distinct, but close to one. This allows the LP Q to discriminate 
better between the duplicates hence Algorithm^ does not necessarily have to enter the post-processing 
loop. We suggest to use p(i) ~ 1 + U(—a,a) for all i, where a <sC 1 and U(a,b) is the uniform 
distribution in the interval [a, b] . 



3 Handling Outliers 



Removing the rank constraint has another advantage: it allows to deal with outliers. If the dataset 
contains outliers, the corresponding diagonal entries of an optimal solution X* of ([3]) will have to be 
large (since outliers cannot be approximated well with convex combinations of points in the dataset). 
However, under some reasonable assumptions, outliers are useless to approximate data points, hence 
off-diagonal entries of the rows of X* corresponding to outliers will be small. Therefore, one could 
discriminate between the columns of W and the outliers by looking at the off-diagonal entries of X* . 
This result is closely related to the one presented in Section 3]. For simplicity, we consider in this 
section only the case where p = 2 and assume absence of duplicates and near-duplicates in the dataset; 
the more general case can be treated in a similar way. 

Let the columns of T € ]R mx * be t outliers added to the separable matrix W[I r ,H'] along with 
some noise to obtain 



M = M + N where M = [W, T]H = [W, T, WH'] U 



I r rxt H' 



n, 



(4) 



which is a noisy r-separable matrix containing t outliers. We propose Algorithm [S] to approximately 
extract the columns of W among the columns of M. 
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Algorithm 4 Post-Processing - Clustering Diagonal Entries of X* [9] 
Input: A matrix M £ R mxn , a vector x € R" , e > 0, and possibly a factorization rank r. 
Output: A index set K* with r indices so that the columns of M(:,K*) are centroids whose corre- 
sponding clusters have large weight (the weights of the data points are given by x). 

= — mj||i for 1 < i,j < n; 
if r is not part of the input then 

else 
end if 

]C = K* = [k\ x(k) > and v = u* = max (2e, min {(iii) | D(iJ)>0} D(i, j)) ; 

while \IC\ < r and v < maxj j D(i,j) do 
5j = {j | -D(i, j) < for 1 < i < n; 

/C = 0; 

while maxi<j< n w(i) > do 

/c = argmaxw(i); /C <— /C U {/c}; 

For all 1 < i < n and j £ S^U Si : w{i) <— u>(i) — x{j); 
end while 
if \K\ > |/C*| then 

£* =/C; z^ = ^*; 
end if 
i/ 2z/; 
end while 

% Safety procedure in case the conditions of Theorem are not satisfied: 
if \K.*\ < r then 

d = maxjj D(i,j); 

Si = {j \D{i,j) <i/*}for l<i<n; 
™W = T^jeSt for 1 < i < n; 
K* = 0; 

while \K*\ < r do 

& = argmaxw(i); K* ^— /C* U {/c}; 

For all 1 < i < n, and j € Sk U 5j : ^— u;(i) — ^ d D ^ l '^ ^j x(j); 
w{k) ^ 0; 
end while 
end if 

Algorithm 5 Extracting Columns of a Noisy Separable Matrix with Outliers by Linear Programming 

Input: A normalized noisy r-separable matrix M = [W, T, WH']U + N G R™ xn with outliers, the 

noise level ||iV||i < e and a vector p G R n with positive distinct entries and p = 2. 
Output: An m-by-r matrix W such that W ~W (up to permutation). 



Compute the optimal solution X* of ([3]) where p has distinct positive entries. 
Let /C = {1 < k < n \ X*(k,k) > \ and ||Jf*(fc, :)||i -X*(k,k) > ±}. 
W = M{:,K). 
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In order for Algorithm [5] to extract the correct set of columns of M, the off-diagonal entries of the 
rows corresponding to the columns of T (resp. columns of W) must be small (resp. large). This can 
be guaranteed using the following conditions (see also Theorem [3] below) : 

• The angle between the cone generated by the columns of T and the columns space of W is 
positive. More precisely, we will assume that for all 1 < k < t 

min ||Tx — Wy||i > rj > 0. (5) 

xEM t + ,x(k) = l 

In fact, if a nonnegative linear combination of outliers (that is, Tx with x > 0) belongs to the 
column space of W, then some data points can usually be reconstructed using a non-zero weight 
for these outliers (it suffices that some data points belong to the convex hull of some columns of 
W and that linear combination of outliers). 

• The matrix \W, T] is robustly conical, otherwise some columns of T could be reconstructed using 
other columns of T whose corresponding rows could hence have large off-diagonal entries. 

• Each column of W is necessary to reconstruct at least one data point, otherwise the off-diagonal 
entries of the row of X* corresponding to that 'useless' column of W will be small, possibly equal 
to zero, and it cannot be distinguished from an outlier. More formally, for all 1 < k < r, there 
is a least one data point M(:,j) = WH(:,j) ^ W(:, k) such that 

min \\M(:, j)-Tx-W(:, K)y ||i > S, where K, = {1,2, ... ,r}\{A;}. (6) 

x>0,y>0 

If Equation ([5]) holds, this condition is satisfied for example when conv(VU) is a simplex and some 
points lie inside that simplex (it is actually satisfied if and only if each column of W define with 
other columns of W a simplex containing at least one data point in its interior). 

These conditions allow to distinguish the columns of W from the outliers using off-diagonal entries 
of an optimal solution X* of 

Theorem 4. Suppose M = M + N where the columns of M sum to one, M = [W, T]H has the 
form (0J with H > 0, maxjj H'- < j3 < 1 and [W,T] K-robustly conical, and \\N\\i < e. Suppose also 
that M , W and T satisfy Equations ([5]) and ([6]) for some r/ > and 5 > 0. // 

e < — 7- where v = mm(K, ??, o ) , 

20(n — 1) 

then Algorithm^ extracts a matrix W G ]£ mxr satisfying \\W — W(:,P)\\i < e for some permutation P. 

Proof. See Appendix [Dj □ 

Unfortunately, the factor is necessary because a row of X* corresponding to an outlier could 
potentially be assigned weights proportional to e for all off-diagonal entries. For example, if all data 
points are perturbed in the direction of an outlier, that is, N(:,j) = eT(:,k) for all j and for some 
1 < k < t, then we could have Y2jjtkX(k,j) = (n — l)O(e) hence it is necessary that e < Oin- 1 ) 
(although it is not likely to happen in practice). A simple way to improve the bound is the following: 

• Identify the vertices and outliers using K. = {1 < k < n | X*(k,k) > \) (this only requires 
e < 20 ' Theorem [1]). 

• Solve the linear program Z* = argmin^x)!!-^ — 
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• Use the sum of the rows of Z* (instead of X*) to identify the columns of W. 
Following the same steps as in the proof of Theorem [5J the bound for e for the corresponding algorithm 



becomes e < 



20(r+t-l) ' 



Remark 6 (Number of outliers). Algorithmic does not require the number of outliers as an input. 
Moreover, the number of outliers is not limited hence our result is stronger than in where the 
number of outliers cannot exceed m — r (because T needs to be full rank, while we only need T to be 
robustly conical and the cone generated by its columns define a wide angle with the column space ofW). 

Remark 7 (Hottopixx and outliers). Replacing the constraint tr(X) = r with tr(X) = r + t (r is 
the number of columns ofW and t is the number of outliers) in the LP model (j5J) allows to deal with 
outliers. However, the number of outliers plus the number of columns of W (that is, r + t) has to be 
estimated, which is rather impractical. 



4 Avoiding Column Normalization 

In order to use the LP models (J2|) and ([3]), normalization must be enforced which may introduce 
significant distortions in the dataset and lead to poor performances [13j. If M is r-separable but its 
columns do not sum to one, we still have that 



M = W[I r ,H']U = [W, WH']U = [W, WH'\ 



H' 



0(n— r)xr 0(n— r) X (n— r) 



n = mx°. 



However, the constraints X(i,j) < X(i, i) for all i,j in the LP's ([2]) and ([3]) are not necessarily satisfied 
by the matrix X°, because the entries of H' can be arbitrarily large. 

Let us denote M Q the original unnormalized noisy data matrix, and its normalized version M, with 



M(:,j) 



M (:,j) 



for all j. 



\\M (:J)\\i 

Let us also rewrite the LP ([3]) in terms of M Q instead of M using the following change of variables 



Xi 



l|M (:,i,||i 

Note that Ya = Xa for all i. We have for all j that 



(7) 



M(:,i)- J3M(:,i)-X< 3 - 



M (:,j) 



M (:,i) \\M (:,i)\\ 
M (:,j)\\i y ||MoO,i)lli \\Mo(:J)\\ 



E 



-Y 



|M (:,j)||i 



M (:,j)-J2M (:,i)Y h 



which proves that the following LP 



nun /diag(Y) such that ||M (:,j) - M r(:,j)||i < pe||Af (:,i)||i for all j, (8) 

where 

y = {Y £Rl xn \Y(i,i)< I for alii, and ||M (:, i)\\xY(i, j) < ||M (:,i)||ay(i, i) for all (9) 
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is equivalent to the LP ([3]). This shows that the LP ([3]) looks for an approximation M Q Y of M Q with 
small relative error, which is in general not desirable in practice. For example, a zero column to which 
some noise is added will have to be approximated rather well, while it does not bring any valuable 
information. Similarly, the columns of M with large norms will be given relatively less importance 
while they typically contain a more reliable information (e.g., in document datasets, they correspond 
to longer documents). 

It is now easy to modify the LP ([8]) to handle other noise models. For example, if the noise added 
to each column of the input data matrix is independent of its norm, then one should rather use the 
following LP trying to find an approximation M a Y of M a with small absolute error: 

min p T diag(Y) such that \\M Q - M Y\\i < pe. (10) 

Remark 8 (Other noise models). Considering other models depending is also possible: one has to 
replace the constraint \\M Q — M Q Y\\i < pe with another appropriate constraint. For example, using any 
d. q -norm with q > 1 leads to efficiently solvable convex optimization programs fl<$ . that is, using 

\\M {:,j)-M Y(:,j)\\ q < pe, for allj. 

Another possibility is to assume that the noise is distributed among all the entries of the input matrix 
independently (that is, some columns could have been contaminated with more noise than others) and 

one could use instead ^J^i j (m o — M Y^ < pe, e.g., \\M — M Y\\p < pe for Gaussian noise (where 

\\.\\f * s the Frobenius norm of a matrix with q = 2). 

5 Numerical Experiments 

In this section, we present some numerical experiments in which we compare our new LP model (fT0j) 
with Hottopixx and two other state-of-the-art methods. First we describe a practical twist to Algo- 
rithm [H which we routinely apply in the experiments to LP-based solutions. 

5.1 Post-Processing of LP solutions 

Recall that the LP-based algorithms return a nonnegative matrix X whose diagonal entries indicate 
the importance of the corresponding columns of the input data matrix M . As explained earlier, there 
are several ways to extract r columns from M using this information, the simplest being to select the 
columns corresponding to the r largest diagonal entries of X [3]. Another approach is to take into 
account the distances between the columns of M and cluster them accordingly; see Algorithm [3J In 
our experiments we have not observed that one method dominates the other (although in theory, when 
the noise level is sufficiently small, Algorithm [3J is more robust [9j). Therefore, the strategy we employ 
in the experiments below selects the best solution out of the two post-processing strategies based on 
the residual error, see Algorithm [6l 

5.2 Algorithms 

In this section, we compare the following near-separable NMF algorithms: 

1. Hottopixx [3J. Given the noise level ||iV||i and the factorization rank r, it computes the optimal 
solution X* of the LP ([2]) (where the input matrix M has to be normalized) and returns the indices 
obtained using Algorithm El The vector p in the objective function was randomly generated using 
the randn function of Matlab. The algorithm of Arora et al. [2] was shown to perform worse than 
Hottopixx [3J hence we do not include it here (moreover, it requires an additional parameter a 
related to the conditioning of W which is difficult to estimate in practice). 
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Algorithm 6 Hybrid Post-Processing for LP-based Near-Separable NMF Algorithms 



Input: A matrix M € W mxn , a factorization rank r, a noise level e, and a vector of weight 
Output: An index set K. such that min#>o||M — M{:,1C)H\\f is small. 

1: % Greedy approach 

2: /Ci is the set of the r largest indices of x; 

3: % Clustering using Algorithm^ 

4: K,2 = Algorithm H/m, x, e, r^j ; 

5: % Select the better of the two 

6: K = argmin^g^^j va.\n H > Q \\M - M(:,K)H\\ 2 F ; 



2. SPA [1]. The successive projection algorithm (SPA) extracts recursively r columns of the input 
normalized matrix M as follows: at each step, it selects the column with maximum li norm, 
and then projects all the columns of M on the orthogonal complement of the extracted column. 
This algorithm was proved to be robust to noise jllj . (Note that there exist variants where, at 
each step, the column is selected according to other criteria, e.g., any £ p norm with 1 < p < +oo. 
This particular version of the algorithm using £2 norm actually dates back from modified Gram- 
Schmidt with column pivoting, see and the references therein.) SPA was shown to perform 
significantly better on several synthetic datasets than Hottopixx and several state-of-the-art 
algorithms from the hyperspectral image community (these algorithms are based on the 
pure- pixel assumption which is equivalent to the separability assumption, see Introduction). 

3. XRAY |13| . In |13j . several fast conical hull algorithms are proposed. We use in this paper 
the variant referred to as max, because it performs in average the best on synthetic datasets. 
Similarly as SPA, it recursively extracts r columns of the input unnormalized matrix M Q : at 
each step, it selects a column of M Q corresponding to an extreme ray of the cone generated 
by the columns of M a , and then projects all the columns of M on the cone generated by the 
columns of M Q extracted so far. XRAY was shown to perform much better than Hottopixx and 
similarly as SPA on synthetic datasets (while performing better than both for topic identification 
in document datasets as it does not require column normalization). However, it is not known 
whether XRAY is robust to noise. 

4. LP pop with p = 1,2. Given the noise level ||iV||i, it computes the optimal solution X* of 
the LP (|10p and returns the indices obtained with the post-processing described in Algorithm [6j 
(Note that we have also tried p = \ which performs better than p = 2 but slightly worse than 
p = 1 in average hence we do not display these results here.) 

Table [1] gives the following information for the different algorithms: computational cost, memory 
requirement, parameters and if column normalization of the input matrix is necessary. 





Flops 


Memory 


Parameters 


Normalization 


Hottopixx [3] 


(mn 2 ) 


O (mn + n 2 ) 


W\u r 


Yes 


SPA [mn] 


2mnr + O (mr 2 ) 


O {mn) 


r 


Yes 


XRAY [E] 


O {mnr) 


O {mn) 


r 


No 


LP (HDJ 


£1 (mn 2 ) 


O {mn + n 2 ) 


\\N\\i 


No 



Table 1: Comparison of robust algorithms for near-separable NMF for a dense m-by-n input matrix. 
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The LP have been solved using the IBM ILOG CPLEX Optimized on a standard Linux box. 
Because of the greater complexity of the LP-based approaches (formulating ([2]) and (imp as LP's 
requires n 2 + mn variables), the size of the input data matrices allowed on a standard machine is 
limited, roughly mn 2 ~ 10 6 (for example, on a two-core machine with 2.99GHz and 2GB of RAM, 
it already takes about one minute to process a 100-by-100 matrix using CPLEX). In this paper, we 
focus on the robustness performance of the different algorithms and only compare them on synthetic 
datasets. Comparison on large-scale real-world datasets would require dedicated implementations, such 
as the parallel first-order method proposed in [3] for the LP ([2]), and is a topic for further research. 
The code for all algorithms is available at https://sites.google.com/site/nicolasgillis/code. 

5.3 Synthetic Datasets 

With the algorithms above we have run a benchmark with certain synthetic datasets particularly 
suited to investigate the robustness behaviour under influence of noise. In all experiments the problem 
dimensions are fixed to m = 50, n = 100 and r = 10. We conducted our experiments with six different 
data models. As we will describe next, the models differ in the way the factor H is constructed 
and the sparsity of the noise matrix N . Given a desired noise level e, the noisy r-separable matrix 
M = M + N = WH + A is generated as follows: 

The entries of W are drawn uniformly at random from the interval [0, 1] (using Matlab's rand 
function). Then the columns of W are normalized to sum to one. 

The first r columns of H are always taken as the identity matrix to satisfy the separability assump- 
tion. The remaining columns of H and the noise matrix A are generated in two different ways (similar 
toUU): 

1. Dirichlet. The remaining 90 columns of H are generated according to a Dirichlet distribution 
whose r parameters are chosen uniformly in [0, 1] (the Dirichlet distribution generates vectors on 
the boundary of the unit simplex so that ||JJ(:, =lforallj). Each entry of the noise matrix 
A is generated following the normal distribution A/*(0, 1) (using the randn function of Matlab). 

2. Middle Points. The r ( r ~ 1 ^ = 45 next columns of H resemble all possible equally weighted convex 
combinations of pairs from the r leading columns of H. This means that the corresponding 45 
columns of M are the middle points of pairs of columns of W. The trailing 45 columns of H are 
generated in the same way as above, using the Dirichlet distribution. No noise is added to the 
first r columns of M, that is, N(:, 1 : r) = 0, while all the other columns are moved toward the 
exterior of the convex hull of the columns of W using 

N(:,j) = M(:,j) -w, for r + 1 < j < n, 

where it; is the average of the columns of W (geometrically, this is the vertex centroid of the 
convex hull of the columns of W). 

We combine these two choices for H and A with three options that control the pattern density 
of A, thus yielding a total of six different data models: 

1. Dense noise. Leave the matrix A untouched. 

2. Sparse noise. Apply a mask to A such that roughly 75% of the entries are set to zero (using the 
density parameter of Matlab's sprand function). 

3. Pointwise noise. Keep only one randomly picked non-zero entry in each nonzero column of A. 

Finally we scale the resulting matrix A by a scalar such that ||A||i = e. In order to avoid a bias 
towards the natural ordering, the columns of M are permuted at random in a last step. 

2 Available for free at http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/ for 

academia. 
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5.4 Error Measures and Methodology 

Let /C be the set of indices extracted by an algorithm. In our comparisons, we will use the following 
two error measures: 

• Index recovery: percentage of correctly extracted indices in tC (recall that we know the indices 
corresponding to the columns of W). 

• l\ residual norm: Denote by \\A\\ S = \ a ij\ the t\ norm for matrices. We measure the relative 
l\ residual by 

. \\M-M(:,JC)H\\ S 
1 — mm ~ . 

H>o \\M\\ S 

Note that both measures are between zero and one, one being the best possible value, zero the worst. 

The aim of the experiments is to display the robustness of the algorithms from Section T5.2I applied 
to the data sets described in the previous section under increasing noise levels. For each data model, 
we ran all the algorithms on the same randomly generated data on a predefined range of noise levels e. 
For each such noise level, 25 data sets were generated and the two measures are averaged over this 
sample for each algorithm. 

5.5 Results 

Figures [1] and [2] display the results for the three experiments of "Dirichlet" and "Middle Points" types 
respectively. In all experiments, we observe that 

• The new LP model (|10p is significantly more robust to noise than Hottopixx, which confirms our 
theoretical results; see Section [2.11 

• The variant of LP (|10p with p = 2 is less robust than with p = 1, as suggested by our theoretical 
findings from Section [2.11 

• SPA and XRAY perform, in average, very similarly. 

Comparing the three best algorithms (that is, SPA, XRAY and LP (jlOp with p = 1), we have that 

• In case of "dense" noise, they give comparable results; although LP (jlOp with p = 1 performs 
slightly worse for the "Dirichlet" type, and slightly better for the "Middle Points" type. 

• In case of "sparse" noise, LP (fTU|) with p = 1 performs consistently better then SPA and XRAY: 
for all noise levels, it identifies correctly more columns of W and the corresponding NMF's have 
smaller l\ residual norms. 

• In case of "pointwise" noise, LP (|10p with p = 1 outperforms SPA and XRAY. In particular, for 
high noise level, it is able to extract correctly almost all columns of W while SPA and XRAY 
can only extract a few for the "Dirichlet" type (performing as a guessing algorithm since they 
extract correctly only r/n = 10% of the columns of W), or none for the "Middle Points" type. 

Note that LP (fTU|) with p = 2 also performs consistently better then SPA and XRAY in case of 
"pointwise" noise. 

Remark 9. For the "Middle Points" experiments and for large noise levels, the middle points of the 
columns of W become the vertices of the convex hull of the columns of M (since they are perturbed 
toward the outside of the convex hull of the columns ofW). Hence, near- separable NMF algorithms 
should not extract any original column of W . However, the index measure for LP (|10p with p = 2 
increases for larger noise level (although the l\ residual measure decreases); see Figure^ It is difficult 
to explain this behavior because the noise level is very high (close to 100%) hence the separability 
assumption is far from being satisfied and it is not clear what the LP (|10p does. 
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Index recovery (Dirichlet / dense noise) 




noise level 



Index recovery (Dirichlet / sparse noise) 




noise level 



L 1 residual (Dirichlet / dense noise) 




0.1 1 ' ' ' ' — ' ' ' ' — 

10" z 10"' 10° 

noise level 
L residual (Dirichlet / sparse noise) 




i , , — , — , — , — I, , , — , — , — , — i i 

10~ 2 10"' 10° 

noise level 



Figure 1: Comparison of near-separable NMF algorithms on "Dirichlet" type data sets. From left to 
right: index recovery and l\ residual. From top to bottom: dense noise, sparse noise and pointwise 
noise. 
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Index rec overy (middle points / dense noise) L residual (middle points / dense noise) 




noise level noise level 



Index recovery (middle points / sparse noise) L residual (middle points / sparse noise) 




noise level noise level 



Figure 2: Comparison of near-separable NMF algorithms on "Middle Points" type data sets. From left 
to right: index recovery and l\ residual. From top to bottom: dense noise, sparse noise and pointwise 
noise. 
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D / dense 


D/sparse 


D/pw 


MP/dense 


MP/sparse 


MP/pw 


Hottopixx 


2.5 


2.5 


3.6 


4.4 


4.3 


4.2 


SPA 


<0.1 


<0.1 


<0.1 


<0.1 


<0.1 


<0.1 


XRAY 


<0.1 


<0.1 


<0.1 


<0.1 


<0.1 


<0.1 


LP {TDK, p = 1 


20.5 


34.1 


39.0 


52.5 


88.1 


41.4 


LP (HQD, p = 2 


10.5 


12.3 


16.0 


32.5 


56.9 


27.4 



Table 2: Average computational time in seconds for the different algorithms and data models. (D stands 
for Dirichlet, MP for middle points, pw for pointwise.) 





D /dense 


D/sparse 


D/pw 


MP/dense 


MP / sparse 


MP/pw 


Hottopixx 


0.014 


0.018 


0.016 


0.016 


0.018 


0.015 


SPA 


0.220 


0.154 


0.052 


0.077 


0.071 


0.032 


XRAY 


0.279 


0.154 


0.052 


0.083 


0.071 


0.032 


LP (HD}, p = l 


0.279 


0.195 


0.197 


0.083 


0.098 


0.178 


lp CED, p = 2 


0.137 


0.121 


0.141 


0.055 


0.055 


0.075 



Table 3: Index recovery robustnesss: Largest noise level ||iV||i for which an algorithm achieves almost 
perfect index recovery (that is, at least 99% on average). 



Table[2]gives the average computational time for a single application of the algorithms to a dataset. 
As expected, the LP-based methods are significantly slower than SPA and XRAY; designing faster 
solvers is definitely an important topic for further research. Note that the Hottopixx model can be 
solved about ten times faster on average than the LP model (|10p . despite the only essential difference 
being the trace constraint tr(X) = r. It is difficult to explain this behaviour as the number of simplex 
iterations or geometry of the central path cannot easily be set in relation to the presence or absence 
of a particular constraint. 

Table [3] displays the index recovery robustness: For each algorithm and data model, the maximum 
noise level ||iV||i for which the algorithm recovered on average at least 99% of the indices corresponding 
to the columns of W . In all cases, the LP (|10p with p = 1 is on par or better than all other algorithms. 



6 Conclusion and Further Work 

In this paper, we have proposed a new more practical and more robust LP model for near-separable 
NMF which competes favorably with two state-of-the-art methods (outperforming them in some cases). 
It would be particularly interesting to investigate the following directions of research: 

• Implementation and evaluation of an algorithm to solve (|10p for large-sale real- world problems. 

• Improvement of the theoretical bound on the noise level for Algorithm [3] to extract the right set 
of columns of the input data matrix in case duplicates and near duplicates are present in the 
dataset (cf. Section [2.2p . 

• Design of practical and robust near-separable NMF algorithms. For example, would it be possible 
to design an algorithm as robust as our LP-based approach but computationally more effective 
(e.g., running in 0(mnr) operations)? 
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A Proof of Theorem Q] 

The next two lemmas are simple generalizations of [SJ Lemmas 2 & 3]. 

Lemma 1. Suppose M = M + N where ||M(:,j)||i = 1 for all j and \\N\\i < e < 1, and suppose X is 
a feasible solution of @ . Then, 

imii < i+^Y^f) and \\ M - MX h<t{jZ^j- 

Proof. First note that \\M\\ X = \\M + N\\ x < \\M\\i + \\N\\t < 1 + e. By the feasibility of X for ©, 
pe > \\M - MX\\ X > \\MX\\ X - \\M\\ X > \\MX\\ X - \\NX\\ X - (1 + e) > ||^||i - e||A-||i - 1 - e, 

hence ||X||i < 1 + e (f±§) , implying that ||JVX||i < \\N\\ X \\X\\ X < e (l + ^r^) • Therefore 



pe > ||M-MX||i = \\M + N- (M + N)X\\ X > \\M - MX\\ X -e-e ( 1 1 (/>+ 



1 - e 



from which we obtain ||M - MX||i < e (p + 2 + ^ffj = e (^J □ 

Lemma 2. Let M = M + N where ||M(:, j)||i = 1 /or aZZ j, admits a rank-r separable factorization 
WH with W K-robustly conical and [|JV||i < e < 1, and /ias t/ie /orm ([1]) mtt maxjj H'- < f3 < 1 and 
W, -ff > 0. Let a/so X 6e any feasible solution of ([3]) ; t/ien 



> 1 ( — J /or a// j swc/i t/iat M(:,j) = W(:,k) for some 1 < k < r . 

Proof. Let /C be the set of r indices such that M(:,JC) = W. Let also 1 < k < r and denote j = )C(k) 
so that M(:,j) = W(:,k). By Lemma [TJ 

||W(:,fc)-^X(:,j)||i< e (^). (11) 

Since H(k,j) = 1, 

WM(:,^=W(:,*;)H(fc,:)X(: s i)+17(: J 7i)ir(7i,:)X(:,j) 

= ft) j) + J)X(J , j)) + W(: 5 ft)y, 

where7J = {l,2,... J r}\{Jfe}, J = {1, 2, . . . , n}\{j} and y = H(K,:)X(:,j) > 0. We have 

r, = X(j,j) + j) < X(j,j) - X(j,j)) , (12) 



since \\H(k, J^lloc < /3 and ||X(:J)||i < 1 + ^ (Lemma!}. Hence 



1 - e 

(P+2)e 



W(:,k)-W(:,TZ): V 



1 — T] 

Combining Equations (jlljl . (11211 and (|13f> . we obtain 



>(l-r/)«. (13) 

l 



i-(x(i,i) +/3 (i + <4±*-A-aiA N )<i^+ 2 



1 — e // k \1 — e 

which gives, using the fact that k, /3 < 1, 



□ 
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Lemma 3. Let M = M+N where \\M(:, j)\\i = 1 for all j, admits a rank-r separable factorization WH 
and \\N\\\ < e, and has the form (Q]). Let K, be the index set with r elements such that M(:,/C) = W . 
Let also X* be an optimal solution of Q such that 

X*(k,k)>~f forallkeJC, (14) 

where < 7 < 1. Then, 

X*(j, j) < 1 - min (7, for all j £ K. 

Proof. Let X be any feasible solution of (J3j> satisfying ([14p . and a = min (7, 0. Let us show that the 
jth column of X for some j ^ fC can be modified as follows 

if i = j, 
if i E K, 
otherwise, 

while keeping feasibility First, aH(i,j) < 7 < X(i,i) for all i 6 K, hence the condition X(i,j) < 
X(i,i) for all i,j is satisfied while, clearly, < X(i,i) < 1 for all i. It remains to show that 
\\M(:,j) - MX(:,j)\\i< pe. By assumption, M(:,j) = WH(:,j) = aWH(:,j) + (1 - a)M(:,j) hence 

M(:,j) = a (M(:, j) + N(:,j)) + (1 - a)M(:,j) 

= a (WH(:,j) + N(:,j)) + (1 - a)M(:,j). 

This gives 

\\M(:,j) - MX(:,j)\\i = a\\M(:,j) + N(:,j) - (W + N(:,)C))H(:,j)\\i < 2ae < pe, 

since the columns of H sum to one, and ||iV||i < e. This result implies that any optimal solution X* 
satisfying (|14[) must satisfy X*(j,j) < 1 — a, otherwise we could replace the jth column of X* using 
the construction above and obtain a strictly better solution since the vector p in the objective function 
only has positive entries. □ 

Proof of Theorem{J\ Let X be an optimal solution of ([3]). Let us first consider the case e = 0, 
which is particular because it allows duplicates of the columns of W in the dataset and the value of p 
does not influence the analysis since pe = for any p > 0. Let denote 

K k = {j I M(:,j) = W(:,k)}, 

the set of indices whose corresponding column of M is equal to the fcth column of W. By assumption, 
k > hence for all 1 < k < r we have W(:,k) ^ cone(W(:, K,)) where K, = {1, 2, . . . , r}\{fc}. This 
implies that X^eK: fc X(J,j) > 1 for all k. Since we are minimizing a positive linear combination of the 
diagonal entries of X and assigning a weight of one to each cluster is feasible (see Equation [1]) , 
we have X]jGi<; fe -^0>i) = 1- Moreover, assigning all the weight to the index in /C^ with the smallest 
entry in p minimizes the objective function (and this index is unique since the entries of p are distinct). 
Finally, for all 1 < k < r, there exists a unique j such that M{:,j) = W(:,k) and X(j,j) = 1 which 
gives the result for e = 0. 

Otherwise e > and j3 < 1 . and the result follows from Lemmas [2] and [3) Let /C be the set of r 
indices such that M(:,/C) = W. By Lemma [21 we have 



X(k, k) > 1 - — ^— ) . for all k e A;. 



while, by Lemma [3l 



XVJ)<msK[l-£-^-^ (£±|) j, forall^/C. 
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Therefore, if 



1 2e fp + 2\ ^ ( p 2e fp + 2 

1 — A — [ - ] > / > max ( 1 — - 

min(l,p) 



K(l-P)\l-eJ V 2'k(1-/3) \l-e 



where f = 1 — min ^ ,p ' = max 1 — 0, then Algorithm [2] extracts the r indices corresponding to the 
columns of W. The above conditions are satisfied if 



k(1-/3) Vl-ey 2 K(l-/3)Vl-e/ 2 



that is, ^ < Taking 



^ k(1 — P) min(l, p) k(1 — f3) min(l, p) 1 — e 



gives the results since e < ~, Km < ttt f° r an Y P > hence > i. □ 



5(p + 2) p + 2 

5^2y ^ i for any p > hence ^ > 5. 

B Proof of Theorem [2] 

Proof of Theorem \^ Let us consider 

W = ( (i _!i) e T ),H=(l r pi r + ^j(Er ~ Ir)) , and TV = 0, 

where £ < /3 < 1 and is K-robustly conical with k > [9]. Let also p = (^ e ) where e is the vector 
of all ones and K is a large constant. The matrix 

x= ( (i-j&tH o ) 

is a feasible solution of ([3]) for any e < In fact, for all 1 < j < r, 

||M(:,i) - MX(:, = P ' |]M(:,j) - M(:, j + r)||x = pe, 

while it can be easily checked that X satisfies the other constraints. By [9J Lemma 7], for K sufficiently 
large, any optimal solution X* of ([3]) must satisfy 

min X*(k,k) < max X(k,k) = 1 



l<fc<r ' l<k<r ' k(1 — /3) ' 

(otherwise p T diag{X*) > p T diag(X) for K sufficiently large). For the columns of W to be extracted, 

0(1 
2 



one requires X*(k, k) > 1 — mm i 1 ' P ' > for all 1 < k < r hence it is necessary that 



pe min(l,p) k(1 — P) min(l, p) 

k(1-P) 2 ^ £< 2 p ' 

for Algorithm [2] to extract the right columns of M (that is, the first r ones). □ 
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C Proof of Theorem [3] 



Proof of Theorem^ The matrix X° from Equation is a feasible solution of ([3]); in fact, 

||M-MX°||i = \\M + N- (M + iV)X°||i < ||M-MX°||i + ||iV||i + ||iVA ||i < 2e, 

since M = MX , \\N\\i < e and ||iVX ||i < ||iV||i||X ||i < e as ||X°||i = 1. Therefore, since p = e, 
any optimal solution X* of ([3]) satisfies 

tr(X*) = p T diag{X*) < p T diag(X°) = r. 

The result then directly follows from [9, Theorem 5]. In fact, Algorithm [3] is exactly the same as [9j 
Algorithm 3] except that the optimal solution of ([3]) is used instead of ([2]) while [SJ Theorem 5] does 
not need the entries of p to be distinct and only the condition tr(X) < r is necessary. Note that [9j 
Theorem 5] guarantees that there are r disjoint clusters of columns of M around each column of W 
whose weight is strictly larger ^q-j-. Therefore, the total weight is strictly larger than r — > r — 1 

while it is at most r (since tr(A*) < r) implying that r = ^i = iX*(i,i) . □ 

D Proof of Theorem [H 

Proof of Theorem [^} In case j3 = 1 , e = and the proof is similar to that of Theorem [TJ the only 
difference is that the condition from Equation §5§ has to be used to show that no weight can be assigned 
to off-diagonal entries of the rows of an optimal solution of ([3]) corresponding to the columns of T. 
Otherwise (3 < 1 and there are no duplicate nor near duplicate of the columns of W in the dataset. 
Let assume without loss of generality that M has the form 

M = [T, W, WH'} + N, 

that is, the first t columns correspond to T and the r next ones to W . Let then X be an optimal 
solution of ©. 

Since [W, T] is K-robustly conical, Theorem [1] applies (as if the columns of T were not outliers) and, 
for all 1 < k < r + t, 

c , 1 

X(k,k) > 1 j- — — ,>-, 

k(1 -P)(l -e) 2 

while X(j,j) < K (i_ ; |)n_ e ) ^ \ f° r au J > r + ^ since e < 2o( n -i) wnere v = rnin(fv, 77, S). Therefore, 
only the first r + t indices can potentially be extracted by Algorithm [5j It remains to bound above 
(resp. below) the off-diagonal entries of the rows of X corresponding to T (resp. W). 
By Lemma [2] (see also [9J Lemma 2]), we have for all 1 < j < n 

\\M(:,j) - MXt.J)^ < -il- and |[X(:,j)[| a < 1 + 

1 — e 1 — e 

Using the fact that \W, T] is K-robustly conical, for all 1 < k < t, we have 

\\T(:,k) - MX(:,k)\\i > (1-X(k,k)) min \\T(:,k) -T(:,iC)x - Wy\\i > (1-X(k,k))n, 

implying that for all 1 < k < t 

since < 5 because e < Therefore, 

E^.*) s lW=.*)ll. - T^~ € + ^dj~~€) - why 
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as k, e < 1. Let t + 1 < j < n and 1 < k < t, we have 

\\M(:,j) - MX(:,j)\\i > minmin \\T(:, k) + T(:,£)y - Wx\\i > r,X(k,j), 

X y>0 

see Equation ([5]), which implies X(k,j) < ^n^i) ■ Hence, for all 1 < A; < t, we have 



,,,, 8e . 4e 8(n - l)e 1 
^ X(k,j) <(t- 1 — + n - r - i)- < A — < 



rj(l - e) ~ - e) ~ 2" 



since f = min(K, 5). By assumption, for each t+l<fc<i + r, there exists some j satisfying 
M(:,j) = WH(:,j) ?W{:,k) and 

min|[M(:,j) - W(:,£)x||i > 6, where £ = {1, 2, . . . , r}\{k}, 

x>0 



see Equation ([6]). For t + r < j < n, we have X(j,j) < ^jzmn^e) ' us denote = ^^gjp^ 



which is an upper bound for the total weight that can be assigned to the columns of M different from 
W and T. Then, using Equation (JSJ) , we have 



HMO^-MXO^Hx^l-^inin 



M(:,j) 



1 



WX(t + l:r + t,j) -Ty 



>(i-rt|i-SM u . 



This implies 



and 



1-A* 



X(k,j) > 1 



> 1 



> 1 



4e 



*(l- M )(l-e) 
n — r — i)e 



4e 



K(l-P){l-e) 5(1 -e) 
8(n - l)e ^ 1 



since /3 < 1 and e < 20( n -i) > an< ^ P ro °f i s complete. 



□ 
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